N-gram language models for document image decoding
نویسندگان
چکیده
This paper explores the problem of incorporating linguistic constraints into document image decoding, a communication theory approach to document recognition. Probabilistic character n-grams (n=2–5) are used in a two-pass strategy where the decoder first uses a very weak language model to generate a lattice of candidate output strings. These are then re-scored in the second pass using the full language model. Experimental results based on both synthesized and scanned data show that this approach is capable of improving the error rate by a factor of two to ten depending on the quality of the data and the details of the language model used.
منابع مشابه
Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation
Neural network language models, or continuous-space language models (CSLMs), have been shown to improve the performance of statistical machine translation (SMT) when they are used for reranking n-best translations. However, CSLMs have not been used in the first pass decoding of SMT, because using CSLMs in decoding takes a lot of time. In contrast, we propose a method for converting CSLMs into b...
متن کاملApproximated and Domain-Adapted LSTM Language Models for First-Pass Decoding in Speech Recognition
Traditionally, short-range Language Models (LMs) like the conventional n-gram models have been used for language model adaptation. Recent work has improved performance for such tasks using adapted long-span models like Recurrent Neural Network LMs (RNNLMs). With the first pass performed using a large background n-gram LM, the adapted RNNLMs are mostly used to rescore lattices or N-best lists, a...
متن کاملCoarse-to-Fine Syntactic Machine Translation using Language Projections
The intersection of tree transducer-based translation models with n-gram language models results in huge dynamic programs for machine translation decoding. We propose a multipass, coarse-to-fine approach in which the language model complexity is incrementally introduced. In contrast to previous orderbased bigram-to-trigram approaches, we focus on encoding-based methods, which use a clustered en...
متن کاملThe LIMSI 1999 Hub-4E Transcription System
In this paper we report on the LIMSI 1999 Hub-4E system for broadcast news transcription. The main difference from our previous broadcast news transcription system is that a new decoder was implemented to meet the 10xRT requirement. This single pass 4-gram dynamic network decoder is based on a time-synchronous Viterbi search with dynamic expansion of LM-state conditioned lexical trees, and with...
متن کاملHigher Order n-gram Language Models for Arabic Diacritics Restoration
Dynamic programming based Arabic diacritics restoration aims to assign diacritics to Arabic words. The technique is purely statistical approach and depends only on an Arabic corpus annotated with diacritics. The possible word sequences with diacritics are assigned scores using statistical n-gram language modeling approach. Using the assigned scores, it is possible to search the most likely sequ...
متن کامل